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Abstract 

Peer-to-Peer (P2P) content sharing systems are suscepti- 
ble to the content pollution attack, in which attackers ag- 
gressively inject polluted contents into the systems to re- 
duce the availability of authentic contents, thus decreas- 
ing the confidence of participating users. 

In this paper, we design a pollution-free P2P con- 
tent sharing system, Green, by exploiting the inherent 
content-based information and the social-based reputa- 
tion. In Green, a content provider (i.e., creator or sharer) 
publishes the information of his shared contents to a 
group of content maintainers self-organized in a security 
overlay for providing the mechanisms of redundancy and 
reliability, so that a content requestor can obtain and filter 
the information of his requested content from the associ- 
ated maintainers. We employ a reputation model to help 
the requestor better identify the polluted contents, and 
then utilize the social (friend-related) information to en- 
hance the effectiveness and efficiency of our reputation 
model. Now, the requestor could easily select an authen- 
tic content version for downloading. While download- 
ing, each requestor performs a realtime integrity verifi- 
cation and takes prompt protection to handle the content 
pollution. To further improve the system performance, 
we devise a scalable probabilistic verification scheme. 

Green is broadly applicable for both structured and un- 
structured overlay applications, and moreover, it is able 
to defeat various kinds of content pollution attacks with- 
out incurring significant overhead on the participating 
users. The evaluation in massive-scale networks vali- 
dates the success of Green against the content pollution. 

1 Introduction 

Peer-to-Peer (P2P) content sharing systems have experi- 
enced an explosive growth since their inception in 1999, 
and now dominate large fractions of both the Internet 
users and traffic volume lfl9l , However, due to the de- 



centralized and unauthentic ated nature, each participat- 
ing user has to manage the potential risks involved in the 
application transactions without adequate experience and 
knowledge about other users. 

Recently, many measurement studies reported that 
content pollution is pervasive in P2P content sharing sys- 
tems, e.g., more than 50% of the copies of popular songs 
are polluted in KaZaA l20l . In a typical content pollu- 
tion attack, the polluters first collectively tamper with the 
target content, degrading its quality or rendering it un- 
usable. Then, they inject a massive number of tampered 
(i.e., polluted) contents into the system. Unable to distin- 
guish between authentic contents and polluted contents, 
genuine users download the polluted contents unsuspect- 
ingly into their own shared folders, from which other 
users may download later without knowing that these 
contents have been polluted, thus passively contributing 
to the dissemination of pollution. As a result, the pol- 
luted contents spread through the whole system at ex- 
traordinary speed. Such content pollution has the detri- 
mental effects of reducing the availability of the original 
authentic content, and ultimately, decreasing the confi- 
dence of users in the P2P content sharing system. 

In general, the simple digital signature scheme could 
be used to defend against content pollution in many Inter- 
net applications. Nevertheless, due to the lack of central- 
ized trusted authorities in P2P content sharing systems, 
the applicability of the signature scheme is questionable. 
So far, many pollution defenses have been further pro- 
posed, and most of them are built upon reputation mod- 
els E |U HO] Hi] El E2 EH . These models utilize the 
information derived from the participating users' feed- 
back; however, they are penalized by the lack of reliable 
user cooperation, and cannot defend against one concrete 
pollution attack: identifier corruption (described in sec- 
tion 13 .2\ . In this paper, we design Green, a pollution- 
free P2P content sharing system. Our main motivation 
is to provide a total defense against various kinds of ex- 
isting pollution attacks by exploiting not only the tradi- 



tional reputation-based information but also the inherent 
content-based information and social-based information. 

Firstly, in Green, a content provider (i.e., creator or 
sharer) publishes his shared contents' information to a 
group of content maintainers self-organized in a security 
overlay for providing the mechanisms of redundancy and 
reliability. This allows a content requestor to obtain the 
authentic information of his requested content by first 
looking up in the overlay for the associated maintain- 
ers, and then executing a proactive filtering process to 
filter out the malicious information corrupted by mali- 
cious and compromised maintainers. 

Secondly, after the proactive information filtering, a 
requestor resorts to the reputation-based information to 
better identify the polluted contents. Here, the requestor 
collects the past vote histories of the users who have 
voted the requested content, so that he could utilize the 
statistical correlation between his local vote history and 
each of these collected vote histories to compute the rep- 
utation score of any version of the requested content. 

Thirdly, in order to enhance the effectiveness and effi- 
ciency of our basic reputation model, a requestor further 
utilizes the social (friend-related) information. Specifi- 
cally, the requestor uses his friends' vote histories to ex- 
tend the local vote history reliably. By considering these 
extended vote histories, each requestor is able to per- 
form the reputation computation more accurately; fur- 
thermore, if many friends have already voted the re- 
quested content, the requestor could use these associated 
friends' votes to efficiently estimate the reputation score. 

Finally, according to the authentic content-based in- 
formation as well as the social-based reputation infor- 
mation, the requestor selects an authentic version of his 
requested content for downloading. In Green, we allow a 
requestor to verify the integrity of the requested content 
while downloading and take prompt protection actions to 
handle content pollution attacks. To reduce the verifica- 
tion overhead, we devise a scalable probabilistic verifi- 
cation scheme in which each requestor has the capability 
of flexibly adjusting the false positive rate of integrity 
verification according to the tradeoff between integrity 
assurance and verification overhead. 

Green is broadly applicable for both structured and un- 
structured overlay applications. Moreover, it is able to 
defend against various kinds of content pollution attacks 
without incurring significant overhead on the participat- 
ing users. We implemented a prototype system, and eval- 
uate its performance on two massive-scale testbeds with 
realistic network traces. The evaluation results illustrate 
that Green can effectively and efficiently defend against 
content pollution in various scenarios. 

The rest of this paper is organized as follows. We spec- 
ify the system model and terminology in section [2] fol- 
lowed by a description of threat model in section [3] The 



details of our Green system are elaborated in section [4] 
We then analyze the defense capacity and maintenance 
overhead of Green in section [5] Section [6] presents the 
experimental design and discusses the evaluation results. 
Finally, we give an overview of related work in section|7] 
and conclude this paper in section [8] 

2 System Model and Terminology 

Current P2P overlay networks can generally be grouped 
into two categories [21|: structured overlay networks, 
e.g., Chord [31] and Pastry [30], that accurately build an 
underlying topology to support rapid searching, and un- 
structured overlay networks, e.g., Gnutella-like HI net- 
works, that merely impose loose structure on the topol- 
ogy. In Green, for providing the mechanisms of redun- 
dancy and reliability, we choose a structured P2P overlay 
network as the underlying structure. Without loss of gen- 
erality, we utilize Chord as the dedicated underlying P2P 
overlay network, and we can also conveniently utilize an- 
other structured overlay as an alternative. Nowadays, the 
unstructured overlay applications are prevalent on the In- 
ternet; to illustrate the broad applicability, we will further 
describe how to deploy Green into unstructured overlays 
in section |4~6l 

Generally, a P2P content sharing system is composed 
of contents and users. To elaborate our design clearly, 
we introduce some related terminology about contents 
and users in the following two sections, respectively. 

2.1 Content-Related Terminology 

We refer to a specific content (e.g., a document, video 
or song) existing in the P2P content sharing system as a 
title. A given title can generally have multiple different 
versions, each of which is published by a group of users. 
For instance, a large number of versions of the same 
video could be produced by various rippers/encoders, 
each of which may create a slightly different version. 
Specifically, each version has an identifier, which is typ- 
ically a hash value of the associated data (and meta- 
data) of the version; therefore, modifications of data (and 
metadata) no matter faithfully or maliciously could also 
produce additional different versions. Finally, each ver- 
sion may have many copies in the system due to the fact 
that participating users could lookup and download vari- 
ous versions of different titles from each other. 

2.2 User- Related Terminology 

In general, there is no centralized infrastructure in P2P 
content sharing systems, and each participating user can 
play one or several of the following three roles: provider, 
requestor and maintainer. 
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A provider publishes the information of his shared 
copies into the P2P content sharing system, and informs 
the associated maintainer(s) of building up or refreshing 
the corresponding index structure. On the other hand, a 
requestor who wants to obtain a copy of a specific title 
can utilize the underlying overlay to route a query to- 
wards the maintainer(s) associated with the requested ti- 
tle, then these maintainer(s) should respond with a list of 
corresponding index records. Once receiving the index 
records, the requestor could filter out malicious records 
and select one version to download from the associated 
providers in parallel. 

3 Threat Model 

To better understand the characteristics of pollution at- 
tack and design a more effective pollution defense, we 
discuss two attack forms of content pollution: decoy in- 
sertion [20 1 and identifier corruption [6 |. 

3.1 Decoy Insertion 

Decoy insertion is a common pollution mechanism uti- 
lized by polluters in today's P2P content sharing systems. 
Here, the polluters target a specific title, and then man- 
ufacture decoys (i.e., corrupted versions with the same 
metadata but different identifiers) for the title in various 
ways, e.g., for media contents, inserting advertisements, 
cutting the duration, and replacing parts of the actual data 
with undecodable white noise. Afterwards, the polluters 
insert a massive number of decoys of the target title into 
the system in order to reduce the availability of authentic 
versions of the title severely. 

In general, when a content requestor searches for a ti- 
tle, the associated maintainer(s) should group the infor- 
mation of the requested title's available copies into dif- 
ferent versions, and then return the grouped result back 
to the requestor. Unable to distinguish between authen- 
tic versions and polluted versions, the requestor would 
select the version with the largest number of copies to 
download; unfortunately, under collective decoy inser- 
tion attacks, this selected version may be polluted. 

3.2 Identifier Corruption 

In a P2P content sharing system, each version is asso- 
ciated with an identifier, which is typically generated by 
applying a hash function to the data (and metadata) of the 
version. The generated version identifier can be used by 
content requestors to identify different versions of a spe- 
cific title, and moreover, this identifier is also important 
for maintainers to group the information of the requested 
title's available copies into different versions. 



(Information) 
Maintainer(s) 




Friend(s) 

Figure 1: System illustration. Note that, each user prac- 
tically plays one or several of these roles simultaneously. 

Due to the feature of hash function, it is usually as- 
sumed that different data (and metadata) would generate 
different identifiers. However, it is still possible to have 
two actually different versions with the same identifer, 
especially for the consideration of efficiency, when some 
widely used P2P content sharing systems adopt weak 
hash functions based only on a fraction of the data (and 
metadata) of a version, e.g., the UUHash (4) function 
employed by KaZaA [2 |. So now, the polluters could an- 
alyze the adopted hash function, and corrupt the fraction 
of data (and metadata) that are not used without altering 
the identifier of the corrupted version. 

Under identifier corruption attacks, when a requestor 
searches for a title, he can receive the grouped informa- 
tion of the requested title's versions from the maintain- 
ers, where each version is associated with an identifier. 
As usual, the requestor selects one version and down- 
loads different blocks of the selected version from differ- 
ent associated providers in parallel. However, if a down- 
loaded block is a corrupted part from the changed ver- 
sion, the entire downloaded version would be polluted. 

4 Design of Green 

In different P2P content sharing systems, each title may 
be associated with a topic, a file or a keyword, depending 
on different system designs. Without loss of generality, 
we use "file" to replace "title" hereafter for clarity. 

In Green, a provider publishes the information of his 
shared files to a group of maintainers self-organized in 
a structured overlay (section 14. Il l, so that a requestor 
could obtain/filter the information of the requested file 
from the associated maintainers (section 14.21 1. After- 
wards, the requestor utilizes a social-based reputation 
model to identify the polluted versions of his requested 
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Table 1 : Design Notation 



Symbol 


Meaning 


F 


File 


Vi 


The i th version of the requested file 


Vh 


The identifier of the version Vi 


D 


The digest of a data block 


DLi 


The digest list of the version Vi 


U 


User 


UI 


User identifier 


P 


Provider 


PI 


Provider identifier 


PILi 


The provider identifier list of the version Vi 


R 


Requestor 


Mi 


The i th maintainer of a file 


01 


Voter identifier 


OIL, 


The voter identifier list of the version Vi 


On 


The j th voter who has voted the version Vi 


°ij 


The vote on V, given by the voter Oij 


OH x 


The vote history of the user X 


OMi 


The i th vote maintainer for a user 


Fi 


The i th friend of the requestor 



file (section l4~3b , and then selects an authentic version 
for downloading (section |4~4| |. While downloading, we 
employ a realtime verification mechanism to help the re- 
questor take prompt actions to handle content pollution 
(section l4~5b . Finally, we discuss how to deploy our de- 
sign in unstructured overlays (section l4~6*l >. In particular, 
the abstract execution process of Green is illustrated in 
Figure [U and the notation that we use to simplify the de- 
scription of our design can be found in Table [T] 

4.1 Content Publication 

As described in section[2] the underlying network struc- 
ture is assumed to be the Chord overlay, so we utilize 
SHA1 to assign each user and file an identifier. The user 
identifier of a participant is chosen by hashing the par- 
ticipant's IP address, while a file identifier is produced 
by hashing the filename. Recently, many users in P2P 
content sharing systems reside behind network address 
translators (NATs), hence a number of participating users 
may share the same public IP address. Moreover, we 
could not guarantee that each user's IP address stays con- 
stant in a dynamic P2P environment. That is, we are not 
able to distinguish between different users solely by their 
IP addresses. To solve the NAT and mobility problems, 
we may adopt another kind of unique and constant iden- 
tifiers (e.g., HIP-based identifiers 11271 ) to replace the IP- 
based identifiers, which is not the focus of this paper. 

Both the user identifiers and file identifiers are ordered 
in an identifier circle. Specifically, the information of a 



Table 2: The information of a specific file 





Version 
ID 


Digest 
List 


Provider ID 
List 


voter iu 
List 


1 


Vh 


DLi 


PILi 


OILi 


2 


VI 2 


DL 2 


PIL 2 


OIL 2 












i 


Vh 


DLi 


PILi 


OILi 












n 


Via 


DL n 


PIL n 


OIL n 



file F is assigned to the first user whose identifier is equal 
to or follows the F's file identifier in the identifier circle. 
This user is called the successor of file F, denoted by 
successor (F). Such structure tends to balance the load 
on the system, since each user maintains the information 
of roughly the same number of files; even if a user main- 
tains the information of a popular file, several existing 
techniques [15, 29] have provided the solutions of load 
balancing. 

In Green, a provider P, who wants to publish a spe- 
cific file F, first divides the file F into b data blocks, and 
then, he computes the digest of each block to construct 
a digest list, i.e., {Di}^ =1 . Here we could further utilize 
the homomorphic hashing technique ITHl to allow future 
file requestors to perform order-independent verification 
on these data blocks. Afterwards, according to F's file 
identifier, the provider P maps the information of file 
F (including the file identifier, digest list and his own 
user identifier) onto the associated maintainer M, i.e., 
successor(F) (step 1 in FigureQ]). Once the maintainer 
M has received such file information, he can additionally 
take hash of the entire digest list as a version identifier of 
file F. Moreover, in order to support the social-based 
reputation model (described in section I4.31 l. each user 
who has voted some versions of the file F should inform 
the associated maintainer M (step 1" in FigureHJ, so that 
the maintainer can know which users have voted the ver- 
sions of file F, i.e., the associated voters. In particular, 
since many providers may individually publish a number 
of versions of file F, the informatiory of F stored at the 
maintainer M is shown in Table|2] 

However, a maintainer may be offline and a malicious 
or compromised maintainer has the capacity of fabricat- 
ing/refusing/corrupting the file information maintained 
by himself. Therefore, we map the information of a spe- 
cific file F onto m maintainers {M;}™ 1 to provide the 
mechanisms of redundancy and reliability, i.e., each file 
F corresponds to a unique group of maintainers. 

Mi = successor(filename\i), 1 < i < m (1) 

1 In this paper, we omit the metadata-related information for clarity. 
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Here, M$ denotes the i th maintainer of the file F, 
filename denotes the name of the file, and m denotes 
the number of maintainers associated with the file. Due 
to the essential feature of structured overlay, these to as- 
sociated maintainers are distributed in the identifier circle 
uniformly at random; therefore, we can design a filtering 
scheme to exclude the malicious activities. 

4.2 Proactive Filtering 

Since the providers publish much information about their 
shared files, we could collect such information from the 
associated maintainers (step 2 in FigureQ}, and then pro- 
vide a proactive defense against the content pollution 
(step 3 in FigureQ~|i as follows. 

In Green, a requestor R, who wants to acquire a spe- 
cific file F, issues a query (i.e., F's file identifier) to- 
wards the to associated maintainers in the system ac- 
cording to the expression Q] Each of these maintainers 
should maintain the information of at least one version of 
the requested file F. Then, these to maintainers respond 
with the corresponding information of the requested file 
F (see Table [2]). On harvesting these responses, the re- 
questor R mixes them and generates a list L consisting 
of (VI, PI) pairs (do not remove duplicate pairs), where 
the VI and PI denote the version and provider identi- 
fiers associated with the requested file F, respectively. 

Based on the generated list L, we design a filtering 
scheme to exclude the malicious activities. Without loss 
of generality, we assume that the total percentage of ma- 
licious and compromised users is j3. Furthermore, as 
specified in section |4~T1 each file corresponds to a group 
of m maintainers. Due to the features of hash-based 
maintainer assignment (see expression [TJ, these main- 
tainers associated with the same file are distributed uni- 
formly at random. Therefore, the probability that x out 
of to maintainers in a group are malicious is 

Pr(m, x) = x j3 x x (1 - /3) m - x (2) 

Thus, the probability that less than half of these m main- 
tainers are malicious is 

TO-1 ^ 

Pr(ra, 0, [—^ — J) = Pr K^) (3) 

The expression [3] implies that though the malicious 
and compromised maintainers may collectively cre- 
ate/delete/modify their maintained (V/, PI) pairs, each 
(VI, PI) pair existing in the generated list L for no less 
than f 22 ^-] (i.e., to — L 12 ^]) times is authentic with 
the probability of Pr(m, 0, L^^J )■ In practice, the sys- 
tem designer would need to estimate the worst network 
environment that the system should sustain in order to 



determine the corresponding l3 worst . Then, according to 
the fi W orst and the expected communication overhead, 
the system designer could further tune the parameter to 
to guarantee that Pr(?n, 0, L 12 ^] ) is large enough even 
in the worst environment. For instance, if f3 worst = 0.2 
and to = 5, the Pr(m, 0, ) will be 94.2%, which 

is sufficient for common systems. In other words, by ap- 
plying the suitable system parameters, the requestor R 
can safely employ a proactive filtering mechanism here 
by filtering the aforementioned list L through removing 
the (VI, PI) pairs with less than f 12 ^-] times. 

In the same way, the requestor R could filter out the 
malicious (VI, OI) pairs, where the VI and OI denote 
the version and voter identifiers associated with the re- 
quested file F, respectively. Moreover, the malicious 
maintainers may choose to corrupt a version identifier 
Vh or a digest list DLi of the file, but this generally 
could not take effect since the requestor is able to de- 
tect such corruption by simply hashing the digest list and 
then comparing the result with the corresponding version 
identifier. Certainly, these malicious and compromised 
maintainers may choose to corrupt a version identifier 
Vli and the corresponding digest list DLi simultane- 
ously; if so, the requestor could also utilize the above 
proactive filtering mechanism to filter out the corrupted 
information. After all the previous kinds of filtering, the 
requestor excludes these filtered information, and groups 
the remainder to locally reconstruct the information of 
the requested file F (see Table|2]). 

4.3 Social-Based Reputation Model 

Reputation model is a common technique to address the 
problem of content pollution. In this section, we first de- 
sign a basic reputation model, and then we use the social 
information to enhance its effectiveness and efficiency. 

4.3.1 Basic Reputation Model 

Based on the proactive filtering mechanism, the re- 
questor R can obtain the authentic information of the 
requested file F by filtering out the malicious activities 
performed by malicious and compromised maintainers. 
However, polluters could choose to simply inject pol- 
luted versions of the file F without corrupting the main- 
tained information. Due to the lack of authenticity verifi- 
cation, the requestor R generally cannot know which ver- 
sions have been polluted. Here, we resort to the general 
reputation-based information to further help requestors 
identify the polluted versions. 

Vote Gathering (step 4 in Figured). The requestor R 
first extracts each version Vi's associated voter identifier 
list OILi from the reconstructed information of file F 
(see Table |2]). Then, the requestor traverses OILi, and 
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contacts each voter Oij indicated by 01 Li to gain the 
voter's past vote history OHij . Once receiving these vote 
histories, the requestor can easily obtain each voter Oij 's 
vote Oij on the version Vi. The vote may be either — 1, if 
the voter considers the version polluted, or 1, otherwise. 

In the above simple vote gathering mechanism, vot- 
ers are required to remain online, otherwise their votes 
would be lost. To solve this problem, besides the locally 
maintained vote history, we employ a redundancy mech- 
anism (similar to the expression!!} to map each user ITs 
vote history onto a group of m vote maintainers (steps 1" 
and 1"' in Figure [T). 



OMi = successor (U 1 < i < m 



(4) 



Here, OMi denotes the i th vote maintainer for the user 
U, UI denotes the identifier of the user, and m denotes 
the number of vote maintainers for the user. Now, no 
matter the voter Oij is online or offline, the requestor R 
has the capacity of obtaining Oij 's past vote history from 
the associated vote maintainers, and then filtering out the 
malicious activities performed by malicious and compro- 
mised vote maintainers as described in section l4~2l Note 
that, when the number of a version's associated voters is 
too large, the requestor should choose a set of randomly- 
selected voters to perform the vote gathering in order to 
control the communication overhead. 

Reputation Computation (step 5 in Figured)- With 
the vote gathering mechanism, the requestor R could ob- 
tain each associated voter Oy's vote o^ on the version 
Vi. Based on these votes, the requestor is able to com- 
pute the reputation score of Vi. The simplest way is to 
execute the unweighted averaging on these votes; how- 
ever, it cannot distinguish between different voters, e.g., 
both like-minded voters and malicious voters are treated 
equally. Instead, in our design, we compute a Credence- 
style ||32l weight coefficient 9 for each vote and execute 
the weighted averaging to compute the reputation score. 
The computation of weight coefficient is described as 
follows: 



p—ab 



y/a(l-a)b(l-b) 



if 



j—ab 



> 0.5 







y/a(l-a)b(l-b) 

if I , " =i < 0.5 



y/a(l-a)b(l-b) 1 



(5) 



Here, given the set of versions on which both R and Oij 
have voted, a and b are the fraction where R and Oij 
voted positively, respectively, and similarly p is the frac- 
tion where both voted positively. Several abnormal cases 
that arise in practice are discussed in ll32l . 

The weight coefficient 0^ ROij ) expresses the statis- 
tical correlation between the two users' vote histories, 
and captures whether they tend to vote identically (i.e., 



7 («,o, 3 ) 



> 0.5), inversely (i.e., 



< —0.5) or un- 



correlatedly (i.e., \0mOi)\ < 0-5). Based on the weight 
coefficient, the requestor performs the weighted averag- 
ing to compute the reputation score of each version Vi . 



Rep(Vi) 



^3 = 



OILi\ 



1 (R,O i 



e[-M] (6) 



Here, \OILi\ denotes the size of the associated voter 

identifier list of Vf, moreover, if Yl\°Ji^ l^(fl,O y )l = 0' 
then Rep(Vi) = 0. This weighted averaging scheme 
gives more weight to votes from like-minded voters, and 
it can be used to assist requestors in better identifying 
polluted versions. 



4.3.2 Social-Based Enhancement 

Currently, many P2P systems have already merged the 
social information, e.g., the friend links, for some pur- 
poses. These friend links are very different from the 
overlay links, and they could be created in various ways. 
For instance, the friends may be these real-world ac- 
quaintances, or many more. A user and his friends gen- 
erally share a similar interest and give similar votes on 
a specific version; moreover, the friends are much more 
trustworthy than other common users. By exploiting the 
information of friends, we can provide two kinds of en- 
hancement for the above basic reputation model: vote 
history extension and efficiency improvement. 

Vote History Extension. In a P2P content sharing 
system, a requestor may have only a few local vote his- 
tory. This would make the requestor cannot accurately 
compute the correlation between each associated voter 
and himself, thus influencing the performance of our rep- 
utation model. To extend the local vote history, the re- 
questor could additionally consider his friends' vote his- 
tories before performing the reputation computation. 

We assume that the requestor R with local vote history 
OHr has / friends in the system, denoted by {Fi}{ =1 ; 
moreover, each friend Fi has the vote history OHp i . In 
our vote history extension scheme, the requestor first col- 
lects his friends' vote histories from the associated vote 
maintainers, by filtering out the malicious activities per- 
formed by malicious and compromised vote maintainers. 
Then, the requestor adopts an attenuation coefficient 7 to 
weigh his friends' vote histories. Finally, the requestor 
performs a simple nonzero averaging on these friends' 
vote histories as well as his local vote history to compute 
the extended vote history OH' R . 



OH' R = avg (OH R , OH Fl x 7, ■ ■ • , OH Ff x 7) 



avg 



(0H R , {OH Fi x 7 }f =1 



(7) 
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Here, each vote history is treated as a vector, and the 
avg is defined to be a function of computing, in turn, 
the averages of nonzero values at each position in these 
vectors. 

To illustrate the computation process intuitively, we 
give an example. Suppose a requestor R with local vote 
history OH R = {1,-1, 0, 0, -1, 0} has two friends F x 
and F2 in the system. So, he is able to collect the two 
friends' vote histories OHp 1 and OHp 2 from the associ- 
ated vote maintainers. 

OH Fl ={0,0,1,0,-1,-1} 
OH F2 ={1,0,1,0,-1, 0} 

In practice, the vote history can be stored and transmitted 
in a more compact way. If the attenuation coefficient 7 
is set to 0.9, then the final extended vote history OH' R is 

Cijji — r 1 + 1x0.9 -1 1x0.9+1x0.9 n 
R I 1+0.9 ' 1 ' 0.9+0.9 ' U ' 

(-!) + (-!) x0.9+(-l)x 0.9 (-l)x0.9 -| 
1+0.9+0.9 ' 0.9 J 

= {1,-1,1,0,-1,-1} 

Due to the fact that the friends are generally trustwor- 
thy, the requestor could apply the vote history extension 
to enrich his vote history reliably; therefore, he is able 
to perform the reputation computation more accurately. 
Note that, if a requestor R has many malicious friends, 
the extended vote history may be polluted; here, some 
existing mechanisms, e.g., SybilGuard [36 1 and Sybil- 
Limit [35 1, could be used to improve the trustworthiness 
of friends. Moreover, if the requestor does not have any 
friends in the system, our social-based reputation model 
falls back to the basic form. In some sense, the requestor 
should "pay the price" for having no friends or having 
many malicious friends. 

Interestingly, our proposed vote history extension 
scheme can not only improve the accuracy of reputation 
computation, but also solve the "cold start" problem, i.e., 
a newly incoming user (i.e., newcomer) without any vote 
history cannot benefit from the reputation model. To 
address the "cold start" problem, once joining the sys- 
tem, the newcomer first builds up the friend links. Then, 
he collects his friends' vote histories to compute the ex- 
tended vote history as described above. Specifically, this 
extended vote history could be treated as the newcomer's 
initial local vote history, so that he could perform the rep- 
utation model normally. 

Efficiency Improvement. From the previous descrip- 
tion of vote history extension, each time a requestor exe- 
cutes the reputation computation, he should collect the 
vote histories from his friends' associated vote main- 
tainers, which is somewhat expensive. In Green, each 
user periodically collects his friends' vote histories from 
these associated vote maintainers, and meanwhile, each 



user maintains a friend-vote database to store the lat- 
est collected vote histories (step 1"" in Figure [TJ. With 
this kind of database, the requestor is able to retrieve his 
friends' vote histories locally. 

Secondly, due to the fact that a user and his friends 
generally have the similar interest, they may give the 
similar votes on a specific version. If many friends of 
a requestor have already voted a version, the requestor 
does not need to compute the reputation score of the 
version again; as an alternative, he could utilize these 
friends' votes to efficiently estimate the reputation score. 

Without loss of generality, we assume that the re- 
questor wants to obtain the reputation score of a specific 
version Vi, and moreover he has / friends, among which 
there are /' friends {-Fj}J =1 who have already given the 

votes {oij}j =1 on the version Vi. Note that, such in- 
formation can be retrieved from the locally maintained 
friend-vote database. Specifically, if only a few friends 
have voted the version (i.e., /' is small), the associated 
votes may be biased, so now we have to return back to 
use the normal social-based reputation model described 
before; otherwise, if a sufficient number (/' > 4 in 
our design) of friends have voted the version, we utilize 
the associated votes to efficiently estimate the reputation 
score Rep'(Vi) of the version Vi. 

Rep'iVi) = x 7 (8) 

Here, the parameter 7 denotes the attenuation coefficient. 
If |i?ep'(Vi)| > 0.5, we consider that these friends voted 
the version Vi identically, so the Rep'(Vi) can be treated 
as the final reputation score Rep(Vi) of the version Vf, 
otherwise, there are significant differences among these 
/' associated friends, so we again return back to use the 
normal social-based reputation model described before. 

With low probability, a malicious user may masquer- 
ade as genuine user to become the requestor's friend, 
and a friend may also be compromised. To address this 
"malicious friend" problem, before performing the ef- 
ficient reputation estimation, the requestor should first 
compute the correlation coefficient 6 between each asso- 
ciated friend and himself, as specified in expression [5] 
If 6 < 0.5, the friend may be malicious or uncorre- 
cted [32, so we choose not to take this friend's vote 
into account. Due to the periodically updated friend-vote 
database, these correlation coefficients could be com- 
puted periodically instead of repeatedly. 

In summary, we use the social (friend-related) infor- 
mation a) to provide a vote history extension mechanism 
to enhance the effectiveness of basic reputation model 
even for a system newcomer, and b) to further enhance 
the efficiency of both the vote history collection and rep- 
utation computation. 
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4.4 Version Selection 

After the proactive filtering (section 14. 2\ and social- 
based reputation computation (section FOl ). the requestor 
has obtained two kinds of information about each version 
Vi of his requested file: the authentic provider identifier 
list PILi and the reputation score Rep(Vi). Based on 
such information, the requestor is able to select a version 
for downloading (step 6 in Figure[T|i. 

Since the size of the associated provider identifier list 
generally reflects the availability of a specific version, in 
many cases, the requestor R selects the version with the 
largest provider identifier list. However, the analytical 
model presented in |[T3l characterized the impact of var- 
ious selection strategies, and implied that the "largest" 
strategy is vulnerable to the pollution attack as well as 
the Denial-of-Service attack. As an alternative, in Green, 
the requestor i?'s choice is biased towards versions with 
more providers. Furthermore, the reputation score of a 
version indicates the authenticity of the version, so the 
probability of selecting a version should also be propor- 
tional to the associated reputation score. Considering the 
above factors, we get the following expression: 



Pr(Vi) 



\PILj\ x RepjVj) 
Ei=t (\PILj\ x RepJVj 



Ki<n 



(9) 



Here, Pr(V^) denotes the probability of selecting the ver- 
sion Vi for downloading, \PILi \ denotes the size of the 
provider identifier list of Vi, and n denotes the number 
of versions associated with the requested file. Specifi- 
cally, as the original Rep(Vi) G [—1, 1], we make a linear 
mapping and let Rep{Vi) — Rep(Vi) + 1 to ensure each 
Pr(^) > 0. Note that, when the denominator of the 
expression is 0, the requestor has to perform an ad hoc 
version selection; however, this might happen only when 
all versions' original reputation scores are —1, which in- 
dicates that there is no authentic version of the requested 
file existing in the system. 

4.5 Downloading with Realtime Verifica- 
tion 

The requestor R starts downloading the data blocks of 
the selected version from these associated providers in 
parallel. Once the requestor receives a data block, he first 
locally computes a digest D by hashing the received data 
block, and then, matches D against the digest existing 
in the corresponding digest list of the requested F's file 
information (reconstructed based on the proactive filter- 
ing as described in section |4~2| |. If they are matched, the 
received data block is accepted, otherwise the block is 
dropped. These operations make the requestor have the 



capacity of verifying the integrity of the selected version 
while downloading (step 7 in Figure [T). 

In common as described above, aiming at verifying the 
data integrity, the requestor R verifies all the received 
data blocks while downloading. To reduce the verifi- 
cation overhead, we propose a probabilistic verification 
scheme which verifies randomly selected blocks instead 
of all blocks. 

In this scheme, the requestor R assumes that a pol- 
luter should tamper with at least r blocks of all b blocks, 
and moreover, less than r polluted blocks could be suc- 
cessfully recovered by the error-correcting code (ECC) 
technique. Based on this assumption, the requestor 
R would need to determine the expected false positive 
rate (EFPR) of the probabilistic verification according to 
the tradeoff between integrity assurance and verification 
overhead. That is, 

f (VP _ (b-r)<x(b-v)\ ifr + v<b 

FPR= {IIT ~ (b-r-v)\xb\ nr+uso 

[o \fr + v>b ( 10 ) 

< EFPR 

Here, FPR and EFPR denote the actual and expected 
false positive rate, respectively; moreover, b denotes the 
total number of data blocks, v denotes the number of ver- 
ified data blocks, and r denotes the lower bound of the 
number of polluted data blocks. Now, the requestor R 
can randomly verify the minimal number of data blocks, 
denoted by v m i n , that satisfies the above expression [TUl 
in order to reduce the verification overhead. 

Finally, after the downloading and verification, the 
requestor builds the complete version from these data 
blocks. If the built version is authentic, the requestor 
gives/publishes a positive vote on the version, and then 
shares the downloaded version; otherwise, if the ver- 
sion is unfortunately found to be polluted, the requestor 
should delete it, give/publish a negative vote, and then re- 
peat from the step of version selection (section l4~4i i until 
he receives the authentic version or the authentic version 
is found to be non-existent in the system. 

4.6 Deployment in Unstructured Overlay 

In the previous sections, we focus on the design of Green 
in structured overlay systems; to illustrate the broad ap- 
plicability, we further describe how to deploy Green in 
unstructured overlay systems. 

Generally, a unstructured overlay system is composed 
of users joining the overlay with some loose rules. To de- 
ploy our design, we introduce a two-tier structure into the 
unstructured overlay: a subset of users (i.e., ultra-users) 
construct a structured mesh, while the other users (i.e., 
leaf-users) connect to the ultra-user tier for participating 
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into the overlay system. Specifically, we adopt the elec- 
tion algorithm in ll22ll to select the ultra-users, and then 
utilize our proposed design to construct the structured 
mesh among ultra-users, so that the unstructured over- 
lay systems could also benefit from our defense mecha- 
nisms against content pollution. In practice, many mod- 
ern unstructured P2P overlay systems have already im- 
plemented some similar two-tier structures, so that the 
deployment of Green in unstructured overlay systems 
will not make many modifications to their original net- 
work structures. Consequently, Green is broadly appli- 
cable, and it is able to work for both structured and un- 
structured overlays. 

5 System Analysis 

A well-designed pollution-free P2P content sharing sys- 
tem should seek to optimize its defense capacity and 
maintenance overhead under the two kinds of common 
content pollution attacks discussed in section[3] 

5.1 Defense Capacity 

Decoy Insertion. Under a decoy insertion attack, the 
polluters inject a massive number of corrupted versions 
(with the same metadata but different identifiers) for a 
target file, in order to reduce the availability of authentic 
versions of the file. In Green, we propose a social-based 
reputation model to defend against such decoy insertion. 

In a P2P content sharing system, the genuine users 
give positive votes on the authentic versions, and neg- 
ative votes on the polluted versions; on the contrary, in 
order to spread the polluted versions, the polluters give 
the opposite votes to value the polluted versions instead 
of authentic versions. However, this kind of malicious 
voting operations will notably impact the correlation be- 
tween polluters and genuine users (see expression[5]l, and 
make genuine users give negative weights (0 < 0) to 
the votes from polluters, i.e., a negative vote from pol- 
luters actually values a version (see expression^. Even 
if the polluters apply a tricky voting strategy by giving 
correct votes sometimes, this will also not influence the 
system performance significantly because the genuine 
users would mark the votes from these tricky polluters 
as uncorrected and then directly drop them. Moreover, 
each requestor considers his friends' votes to extend the 
vote history reliably (see expression |7j, so that the re- 
questor could perform the reputation computation more 
accurately. To enhance the efficiency, if many friends of 
a requestor have voted a specific version, the requestor 
further utilizes his friends' votes to quickly estimate the 
reputation score of the version (see expression[8]). 

Consequently, though polluters inject many corrupted 
versions of a target file into the system and perform the 



malicious voting, the genuine users are still able to select 
the authentic versions for downloading effectively and 
efficiently. 

Identifier Corruption. To launch an identifier corrup- 
tion attack, the polluters should first analyze the prop- 
erties of the hash function used by a P2P content shar- 
ing system. Then, they accurately corrupt a fraction of 
data (e.g., the data that are not used) without changing 
the identifier of the corrupted version. Unfortunately, the 
reputation model cannot identify such attacks due to the 
unchanged version identifier. In our design, we propose 
a block-based verification mechanism to defend against 
the identifier corruption. 

In Green, a maintainer manages many digest lists, each 
of which is associated with a maintained version. A 
requestor could obtain the digest lists, and then utilize 
them to verify the integrity of data blocks while down- 
loading (see section 14.51 ). Due to the fact that each 
data block is used to generate the digest list and each 
block digest is verified, it is very difficult for polluters 
to perform the identifier corruption attack. Furthermore, 
we design a probabilistic verification mechanism to re- 
duce the verification overhead. Here, each requestor 
randomly verifies a number of data blocks of the se- 
lected version (see expression [TOb. so that the polluters 
cannot always launch successful identifier corruption at- 
tacks by corrupting some fixed data blocks. Indeed, the 
probabilistic verification may impact our defense capac- 
ity against identifier corruption. There is a tradeoff be- 
tween integrity assurance and verification overhead. If a 
requestor wants to completely defeat the identifier cor- 
ruption, he should verify all the downloaded data blocks. 

5.2 Maintenance Overhead 

To support the running of Green, a participating user 
plays two roles: file information maintainer and vote 
maintainer; and meanwhile, each user concretely main- 
tains three information sets: a set of file information, a set 
of common users' vote histories, and a set of his friends' 
vote histories. Specifically, in Green, the maintained in- 
formation is distributed averagely, and it will not incur 
much maintenance overhead on the participating users. 
The reason is twofold. 

Firstly, since we use the hash function to perform the 
assignment of file information maintainers, each user 
maintains the information of roughly the same number of 
files, thus achieving a good load balancing. As shown in 
Table |21 the information of a file is composed of a num- 
ber of (Vli, DLi, PILi,OILi) quaternions. In each 
quaternion, the version identifier Vli is a hash value; the 
digest list DLi is a list of data block digests, here the 
system designer could control the size of DLi through 
bounding the number of divided data blocks of a file; 
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moreover, both PI Li and OILi merely store the asso- 
ciated provider identifiers and voter identifiers, respec- 
tively, which will not aggravate a user's maintenance bur- 
den significantly. In addition, even if a user maintains the 
information of a popular file (with a large number of ver- 
sions, providers and voters), several techniques ifTTl |29l 
could be employed here for load balancing. 

Secondly, due to the hash-based vote maintainer as- 
signment and the redundancy mechanism, each user 
maintains roughly m common users' vote histories on 
average; furthermore, each user maintains a friend-vote 
database consisting of his friends' vote histories. Gener- 
ally, maintaining these vote histories will also not aggra- 
vate the burden of a user significantly. 

6 Evaluation 

In this section, we first describe the experimental setup, 
and then we present the key performance metric. Finally, 
we comparably evaluate the performance of Green. 

6.1 Experimental Setup 

To evaluate the performance, we developed a prototype 
system implementing all our proposed security mech- 
anisms with approximately 5040 lines of Java code. 
Specifically, all the experiments are run on an Open- 
Power720 with four-way dual-core POWER5 CPUs and 
16GB RAM running SLES 9.1. 

Network Model: We use two realistic massive-scale 
network traces [25|, YouTube (1,157,827 users and 
4,945,382 friend links, Jan 2007) and Flickr (1,846,198 
users and 22,613,981 friend links, Jan 2007), to gener- 
ate the social/friend networks. Specifically, since these 
traces do not indicate which users are genuine or mali- 
cious, we have to perform an appropriate processing on 
the two generated social networks. 

According to the small-world property of online social 
networks (|5]|25|, we generate 50 genuine user groups for 
each generated social network as follows. We first ran- 
domly select 50 bootstrapping users as the seeds of these 
genuine user groups, and then use the breadth-first search 
algorithm to extend these genuine user groups to guaran- 
tee that each group contains a different number of unique 
users (Zipf distribution with its parameter a = 1). Note 
that, if we cannot generate a required number of genuine 
users from a bootstrapping user, we randomly select an- 
other bootstrapping user to replace this one. Except these 
users contained by genuine user groups, the other users 
are polluters. 

We choose Chord PD as the underlying P2P overlay 
of Green, and all the users participate in the Chord over- 
lay to construct a structured overlay network, so that we 



could use it to assign/lookup the file information main- 
tainers, the vote maintainers, etc. 

Content Model: The previous study in [20 1 reported 
the existence of a large number of polluted versions for a 
single file. In our experiments, there are 10, 000 unique 
files, each of which has 100 different versions. Since we 
do not model the corruption activities performed by ma- 
licious maintainers, the information of a file is stored at 
only one information maintainer, and similarly the vote 
history of a user is also stored at one vote maintainer. 
Moreover, each version is associated with a number of 
providers which vary over time. 

At the start of our experiments, each genuine user 
shares 20 authentic versions, and each polluter shares 
400 and 50 polluted versions when performing decoy in- 
sertion and identifier corruption, respectively; this mod- 
els a highly malicious environment. Here, the versions 
shared by a user are determined by first selecting a cer- 
tain file and then its version. Specifically, both selections 
follow Zipf distribution with a = 0.8 ll20l . 

Execution Model: Different queries are initiated at 
uniformly distributed users, and the attenuation coeffi- 
cient 7 is set to 0.9. Specifically, an experiment is com- 
posed of 50 experimental cycles, and each experimental 
cycle is divided into 5,000,000 query cycles. In each 
query cycle, the selection of a specific version to down- 
load is done by first selecting a file according to Zipf 
distribution with a = 0.8, and then choosing a version 
based on our proposed schemes. After each experimental 
cycle, the number of authentic downloads is computed. 

For a genuine user, he gives a vote on the downloaded 
version with a probability of P a . Here, we add noise 
to model users' mistakes through making genuine users 
vote correctly with only 90% probability. Further, if a 
downloaded version is polluted, the genuine user deletes 
this version with a probability of Pd- On the other hand, 
for a polluter, he always gives a malicious (i.e., oppo- 
site) vote on the downloaded version, and then shares it. 
In addition, each user collects his friends' vote histories 
from the associated vote maintainers and updates the lo- 
cal friend-vote database once per 100, 000 query cycles. 

6.2 Performance Metric 

In our experiments, we characterize the system perfor- 
mance using the fraction of authentic downloads. It is 
defined as the fraction of downloads that the genuine 
users acquire authentic versions during one experimen- 
tal cycle. 

6.3 Evaluation Results 

In this section, we evaluate the performance of Green 
under decoy insertion and identifier corruption, by con- 
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Figure 2: Evaluation results under the decoy insertion attack 



Table 3: Experimental Notation 



Symbol 


Meaning 


P 


The total fraction of polluters 


Po 


The probability that a genuine user 




gives a vote on the downloaded version 


Pd 


The probability that a genuine user 




deletes the downloaded polluted version 


FPR 


The false positive rate of probabilistic 




verification 



sidering four parameters: /3, P a , Pd and FPR (see Ta- 
bleEJ. Their default values are 0.2, 0.9, 0.9 and 0.133, 
respectively. Specifically, in each experiment, we may 
vary only one parameter and leave the others default. 

6.3.1 Under Decoy Insertion Attack 

Comparison. Using two realistic traces, we evaluate the 
performance of Green as compared to four other systems 
with the default parameters. 

• Baseline: The probability of selecting a version 
Vi to download is proportional to \PILi\ (i.e., the 
number of Vi's associated providers). 

• Credence: An ingenious and representative pollu- 
tion defense system deployed on live networks (32]. 

• Social-based Credence: The Credence system with 
our social-based enhancement. 

• Non-social Green: The Green system without our 
social-based enhancement. 

As shown in Figures [2a] and |2bJ Green outperforms 
all other systems with only 3% of all the downloaded 
versions being polluted; moreover, the "Baseline" can- 
not work well. Specifically, two systems with social- 
based enhancement, i.e., Green and "Social-based Cre- 
dence", converge faster than "Non-social Green" and 
"Credence". This is because that, though each user has 
only a small number of vote history at the startup, the 



user in social-based systems could additionally consider 
his friends' vote histories to help judging the version au- 
thenticity more accurately. 

Furthermore, we could find that the two systems with 
social-based enhancement, i.e., Green and "social-based 
Credence", in the network with Flickr trace converge a 
little faster than in the network with YouTube trace. The 
reason is that each user in Flickr has 12.25 friend links 
on average which is larger that 4.27 in YouTube. This 
phenomenon also implies that, in order to achieve a good 
performance, a user in Green actually does not need too 
many friends. 

Since Green always outperforms the four other sys- 
tems in the following experiments and it performs simi- 
larly in the two networks with YouTube and Flickr traces, 
respectively, we hereafter choose to merely present 
the experimental results of Green in the network with 
YouTube trace for saving the space. 

Impact of Polluters. In this experiment, we investi- 
gate the influence incurred by different fractions of pol- 
luters (/3). As shown in Figure [2c] the performance of 
Green decreases slightly with the growth of the fraction 
of polluters, and Green can work well even in a highly 
malicious environment with 40% of users being pol- 
luters. Here, the slight performance decrease is mostly 
due to the fact that, with the increase of polluters, both 
the fraction of genuine users' malicious friends and the 
number of polluted versions increase in the system. Also, 
the result implies that if a user could ensure most of his 
friends being genuine, he is able to select an authentic 
version to download with very high probability; this ac- 
tually provides a strong incentive for genuine users to 
maintain the trustworthiness of their friends. 

Impact of Voting. Figure [2d] shows that Green can 
work well with various different voting probabilities 
(P„); moveover, even if genuine users are very inactive to 
vote the downloaded versions (P a = 0.2), Green can still 
work well. This benefits from our vote history extension 
scheme which could extend a user's vote history by con- 
sidering his friends' vote histories. Specifically, in cur- 
rent P2P content sharing systems, only a few users may 
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Figure 3: Evaluation results under the identifier corruption attack 



vote the downloaded versions [20|, so the above experi- 
mental result further indicates that Green can indeed be 
deployed in practice. Note that, if P a — 0, Green would 
fall back to a Baseline-like system and any existing rep- 
utation systems cannot work, under decoy insertion. 

Impact of Deleting. We further evaluate the system 
performance with different deleting probabilities (Pd)- 
In Green, the deleting strategy impacts the number of a 
polluted version's associated providers, thus influencing 
the probability of selecting this version to download in 
the future (see expression |9). Hence, we would expect 
that the system performance increases with the growth 
of deleting probability. 

Interestingly, different deleting probabilities have, 
however, no significant impact on the performance of 
Green. Our analysis takes three factors into considera- 
tion. Firstly, a user may activate his own deleting module 
only after he has already downloaded a polluted version. 
Secondly, Green is able to identify most of the polluted 
versions, and therefore, a genuine user downloads pol- 
luted versions with very low probability. Thirdly, even 
if a genuine user has downloaded a polluted version and 
does not delete it, he may give a negative vote on this 
version; thus, other genuine users could identify this pol- 
luted version based on the negative vote. As a result, 
no matter which deleting strategy is adopted by these 
genuine users, Green could always work well under de- 
coy insertion attacks. Specifically, Green also keeps this 
property under identifier corruption attacks. Due to the 
space limit, we omit the two corresponding figures. 

6.3.2 Under Identifier Corruption Attack 

Comparison. In this experiment, we comparably eval- 
uate the performance of Green in the network with 
YouTube trace. Figure [3a] illustrates that Green out- 
performs all other four systems. Moreover, the three 
systems, i.e., "Baseline", "Credence" and "Social-based 
Credence", which do not perform verification while 
downloading cannot defend against identifier corrup- 
tion effectively; this validates the effectiveness of our 



proposed realtime verification mechanism. In addition, 
the experimental result also indicates that the effect of 
social-based enhancement is limited under the identifier 
corruption attack. 

Due to the space limit and the similar performance in 
both networks with the YouTube and Flickr traces, re- 
spectively, we omit to present the system performance 
using Flickr trace, and hereafter choose to merely present 
the experimental results of Green in the network with 
YouTube trace. 

Impact of Polluters. We vary the fraction of polluters 
((3), and investigate the influence incurred by polluters. 
Figure [3b] shows that Green can defeat identifier corrup- 
tion even in a highly malicious environment with 40% of 
users being polluters. In real-world P2P systems, some 
polluters may perform Sybil [ 12 1 attacks to create a large 
number of virtual malicious users, and the above exper- 
imental result implies that Green could also work well 
under the Sybil attack. Of course, we are able to incor- 
porate Green with some other Sybil defenses, e.g., Sybil- 
Guard [36 1 and SybilLimit ll35l . to achieve a better sys- 
tem performance. 

Impact of Voting. As shown in Figure [3c] we eval- 
uate the system performance with the voting probability 
(P ) varying in steps of 0.2. The experimental result il- 
lustrates that the voting probability has little effect on the 
system performance, i.e., our system always works well 
no matter which voting strategy the genuine users adopt. 
The reason of this phenomenon is that Green's defense 
capacity against identifier corruption is mostly provided 
by the realtime verification mechanism which does not 
utilize the vote history. This experimental result further 
validates that our proposed mechanism is applicable in 
real-world P2P systems where only a few users may give 
votes on the downloaded versions. 

Impact of False Positive Rate. In Green, each user 
is able to flexibly adjust the false positive rate (FPR) 
of probabilistic verification. The experimental result in 
Figure[3d]shows that the false positive rate of probabilis- 
tic verification impacts our defense capacity against the 
identifier corruption attack. There is a tradeoff between 
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integrity assurance and verification overhead. The more 
data blocks a user verifies, the better defense capacity 
he can gain; whereas the pattern is reversed. This result 
validates our analysis in section |5T| 

7 Related Work 

7.1 Pollution Defenses 

So far, many reputation models have been proposed to 
address the problem of content pollution in P2P content 
sharing systems. In general, these reputation models can 
be grouped into three main categories: peer-based mod- 
els, object-based models and hybrid models. 

In peer-based reputation models, e.g., EigenTrust ifTTl , 
PeerTrust 11331 and Scrubber |9], to reflect the level of 
honesty, each participating user is assigned a reputation 
score by considering his past behaviors in pairwise trans- 
actions. According to the reputation score, genuine users 
could collectively identify content polluters, and then 
isolate these polluters from the system. However, the 
studies in lfl3l[32ll evaluated the potential impact of peer- 
based reputation models, and implied that these models 
are insufficient to defend against the pollution attack. 

Among the object-based reputation models, Cre- 
dence 1 32 1 is the typical representative of them. In Cre- 
dence, genuine users determine the object authenticity 
through secure tabulation and management of endorse- 
ments from other users. This model utilizes the statistical 
correlation to measure the reliability of users' past votes, 
and designs a decentralized flow-based trust computa- 
tion to discover trustworthy users. However, a newcomer 
without vote history hardly has any capacity of distin- 
guishing between authentic objects and polluted objects. 

Aiming at combining the benefits of both peer-based 
and object-based models, several hybrid reputation mod- 
els, e.g., XRep 0J], X 2 Rep (To) and extended Scrub- 
ber |8), have been further presented. Nevertheless, due 
to the fact that most of the participating users in P2P con- 
tent sharing systems are rational in seeking to maximize 
their individual utilities, the reputation models are penal- 
ized by the lack of reliable user cooperation. 

Currently, without resorting to reputation models, sev- 
eral other pollution defenses have been proposed. Micro- 
payment techniques, e.g., MojoNation |3| and PPay 11341 . 
can be utilized to counter the pollution attack by impos- 
ing a cost on content polluters — to inject polluted con- 
tent into the system they should first commit a certain 
amount of resources. Furthermore, Habib et al. in |fl6l 
developed an integrated security framework to verify 
content integrity in P2P media streaming. This frame- 
work could provide high assurance of data integrity with 
low computation and communication overheads; how- 
ever, it requires centralized supplying peers. In 11231 . 



Michalakis et al. presented a "Repeat and Compare" 
system to ensure content integrity by utilizing the P2P 
substrate to repeat content generation on other peers and 
compare the results to detect content pollution. This sys- 
tem focuses on the replication of computations instead 
of data replication and comparison, and its application 
field is generally the P2P content distribution networks. 
Recently, Chen et al. in J7j proposed a new pollution de- 
fense based on the notion that the content providers are 
the only sources to accurately distinguish polluted con- 
tents and verify the integrity of the requested contents; 
however, it cannot defeat decoy insertion effectively. 

7.2 Social Networking 

Social networking sites such as YouTube, Facebook, 
MySpace, Orkut and Flickr are experiencing explosive 
growth, both in terms of involved communities and over- 
all participating users. The social networking technique 
has significantly impacted how Internet users make use 
of the today's Internet. So far, several proposals have 
been made to improve existing distributed systems by ex- 
ploiting the inherent properties of social networks. 

PGP 1 37 ] is one of the early social networking appli- 
cations, in which the participants create a "web of trust" 
to authenticate public keys based on their acquaintances' 
opinions in a fully self-organized manner. This "web of 
trust" model utilized the friend-of-friend trust structure, 
and then Garriss et al. in |[T4l adopted the similar struc- 
ture to develop the "Reliable Email (Re:)", an automated 
whitelisting-based email acceptance system, to mitigate 
email spam. Re: exploits social relationships among 
email correspondents, and moreover, it does not incur 
false positives among socially connected users. In (24), 
Mislove et al. tried to combine the information contained 
in both hyperlinks and social links, so they merged the 
social networking technique into the Web search engine 
to optimize the ranking results by considering various in- 
terests of different users. 

To defend against Sybil attacks |[T2l . two promising 
decentralized social-based protocols, SybilGuard 11361 
and SybilLimit [35], have been proposed. They utilize 
the social networks among user identities based on the 
fact that Sybil users could create many identities but few 
trust relationships; finally, the two protocols have the ca- 
pacity of allowing only very limited Sybil users to be 
accepted even in a large-scale network. Furthermore 
in 1 26 1, based on the similar fact that it is difficult for 
a user to create arbitrarily many trust relationships, the 
Ostra system explored the use of existing social links 
to impose a cost on the information senders, thus pre- 
venting the adversary from sending excessive unwanted 
communication. Recently, Ramachandran and Feamster 
in [28 1 proposed a framework called Authenticatr to es- 



13 



tablish authenticated out-of-brand communication chan- 
nels between applications by utilizing social links exist- 
ing in various social networking sites. 

8 Conclusion 

In this paper, we design a pollution-free P2P content 
sharing system, Green. In Green, a content provider 
(i.e., creator or sharer) publishes the information of his 
shared contents into a security overlay; in order to ob- 
tain the authentic contents, a content requestor can uti- 
lize the inherent content-based information to perform 
the proactive filtering and realtime verification, and uti- 
lize the traditional reputation information and the social 
information to perform the social-based reputation com- 
putation. Green is broadly applicable for both structured 
and unstructured overlay applications. The analysis and 
prototype-based evaluation (in massive-scale networks 
with realistic traces) indicate that Green is able to defend 
against the content pollution effectively and efficiently. 
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